INTERSPEECH.2019 - Speech Recognition

Total: 198

#1 Very Deep Self-Attention Networks for End-to-End Speech Recognition [PDF] [Copy] [Kimi1]

Authors: Ngoc-Quan Pham ; Thai-Son Nguyen ; Jan Niehues ; Markus Müller ; Alex Waibel

Recently, end-to-end sequence-to-sequence models for speech recognition have gained significant interest in the research community. While previous architecture choices revolve around time-delay neural networks (TDNN) and long short-term memory (LSTM) recurrent neural networks, we propose to use self-attention via the Transformer architecture as an alternative. Our analysis shows that deep Transformer networks with high learning capacity are able to exceed performance from previous end-to-end approaches and even match the conventional hybrid systems. Moreover, we trained very deep models with up to 48 Transformer layers for both encoder and decoders combined with stochastic residual connections, which greatly improve generalizability and training efficiency. The resulting models outperform all previous end-to-end ASR approaches on the Switchboard benchmark. An ensemble of these models achieve 9.9% and 17.7% WER on Switchboard and CallHome test sets respectively. This finding brings our end-to-end models to competitive levels with previous hybrid systems. Further, with model ensembling the Transformers can outperform certain hybrid systems, which are more complicated in terms of both structure and training procedure.

#2 Jasper: An End-to-End Convolutional Neural Acoustic Model [PDF] [Copy] [Kimi1]

Authors: Jason Li ; Vitaly Lavrukhin ; Boris Ginsburg ; Ryan Leary ; Oleksii Kuchaiev ; Jonathan M. Cohen ; Huyen Nguyen ; Ravi Teja Gadde

In this paper we report state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data. Our model, Jasper, uses only 1D convolutions, batch normalization, ReLU, dropout, and residual connections. To improve training, we further introduce a new layer-wise optimizer called NovoGrad. Through experiments, we demonstrate that the proposed deep architecture performs as well or better than more complex choices. Our deepest Jasper variant uses 54 convolutional layers. With this architecture, we achieve 2.95% WER using a beam-search decoder with an external neural language model and 3.86% WER with a greedy decoder on LibriSpeech test-clean. We also report competitive results on Wall Street Journal and the Hub5’00 conversational evaluation datasets.

#3 Unidirectional Neural Network Architectures for End-to-End Automatic Speech Recognition [PDF] [Copy] [Kimi1]

Authors: Niko Moritz ; Takaaki Hori ; Jonathan Le Roux

In hybrid automatic speech recognition (ASR) systems, neural networks are used as acoustic models (AMs) to recognize phonemes that are composed to words and sentences using pronunciation dictionaries, hidden Markov models, and language models, which can be jointly represented by a weighted finite state transducer (WFST). The importance of capturing temporal context by an AM has been studied and discussed in prior work. In an end-to-end ASR system, however, all components are merged into a single neural network, i.e., the breakdown into an AM and the different parts of the WFST model is no longer possible. This implies that end-to-end neural network architectures have even stronger requirements for processing long contextual information. Bidirectional long short-term memory (BLSTM) neural networks have demonstrated state-of-the-art results in end-to-end ASR but are unsuitable for streaming applications. Latency-controlled BLSTMs account for this by limiting the future context seen by the backward directed recurrence using chunk-wise processing. In this paper, we propose two new unidirectional neural network architectures, the time-delay LSTM (TDLSTM) and the parallel time-delayed LSTM (PTDLSTM) streams, which both limit the processing latency to a fixed size and demonstrate significant improvements compared to prior art on a variety of ASR tasks.

#4 Analyzing Phonetic and Graphemic Representations in End-to-End Automatic Speech Recognition [PDF] [Copy] [Kimi1]

Authors: Yonatan Belinkov ; Ahmed Ali ; James Glass

End-to-end neural network systems for automatic speech recognition (ASR) are trained from acoustic features to text transcriptions. In contrast to modular ASR systems, which contain separately-trained components for acoustic modeling, pronunciation lexicon, and language modeling, the end-to-end paradigm is both conceptually simpler and has the potential benefit of training the entire system on the end task. However, such neural network models are more opaque: it is not clear how to interpret the role of different parts of the network and what information it learns during training. In this paper, we analyze the learned internal representations in an end-to-end ASR model. We evaluate the representation quality in terms of several classification tasks, comparing phonemes and graphemes, as well as different articulatory features. We study two languages (English and Arabic) and three datasets, finding remarkable consistency in how different properties are represented in different layers of the deep neural network.

#5 Untranscribed Web Audio for Low Resource Speech Recognition [PDF] [Copy] [Kimi1]

Authors: Andrea Carmantini ; Peter Bell ; Steve Renals

Speech recognition models are highly susceptible to mismatch in the acoustic and language domains between the training and the evaluation data. For low resource languages, it is difficult to obtain transcribed speech for target domains, while untranscribed data can be collected with minimal effort. Recently, a method applying lattice-free maximum mutual information (LF-MMI) to untranscribed data has been found to be effective for semi-supervised training. However, weaker initial models and domain mismatch can result in high deletion rates for the semi-supervised model. Therefore, we propose a method to force the base model to overgenerate possible transcriptions, relying on the ability of LF-MMI to deal with uncertainty. On data from the IARPA MATERIAL programme, our new semi-supervised method outperforms the standard semi-supervised method, yielding significant gains when adapting for mismatched bandwidth and domain.

#6 RWTH ASR Systems for LibriSpeech: Hybrid vs Attention [PDF] [Copy] [Kimi1]

Authors: Christoph Lüscher ; Eugen Beck ; Kazuki Irie ; Markus Kitza ; Wilfried Michel ; Albert Zeyer ; Ralf Schlüter ; Hermann Ney

We present state-of-the-art automatic speech recognition (ASR) systems employing a standard hybrid DNN/HMM architecture compared to an attention-based encoder-decoder design for the LibriSpeech task. Detailed descriptions of the system development, including model design, pretraining schemes, training schedules, and optimization approaches are provided for both system architectures. Both hybrid DNN/HMM and attention-based systems employ bi-directional LSTMs for acoustic modeling/encoding. For language modeling, we employ both LSTM and Transformer based architectures. All our systems are built using RWTH’s open-source toolkits RASR and RETURNN. To the best knowledge of the authors, the results obtained when training on the full LibriSpeech training set, are the best published currently, both for the hybrid DNN/HMM and the attention-based systems. Our single hybrid system even outperforms previous results obtained from combining eight single systems. Our comparison shows that on the LibriSpeech 960h task, the hybrid DNN/HMM system outperforms the attention-based system by 15% relative on the clean and 40% relative on the other test sets in terms of word error rate. Moreover, experiments on a reduced 100h-subset of the LibriSpeech training corpus even show a more pronounced margin between the hybrid DNN/HMM and attention-based architectures.

#7 Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition [PDF] [Copy] [Kimi1]

Authors: Naoyuki Kanda ; Shota Horiguchi ; Ryoichi Takashima ; Yusuke Fujita ; Kenji Nagamatsu ; Shinji Watanabe

In this paper, we propose a novel auxiliary loss function for target-speaker automatic speech recognition (ASR). Our method automatically extracts and transcribes target speaker’s utterances from a monaural mixture of multiple speakers speech given a short sample of the target speaker. The proposed auxiliary loss function attempts to additionally maximize interference speaker ASR accuracy during training. This will regularize the network to achieve a better representation for speaker separation, thus achieving better accuracy on the target-speaker ASR. We evaluated our proposed method using two-speaker-mixed speech in various signal-to-interference-ratio conditions. We first built a strong target-speaker ASR baseline based on the state-of-the-art lattice-free maximum mutual information. This baseline achieved a word error rate (WER) of 18.06% on the test set while a normal ASR trained with clean data produced a completely corrupted result (WER of 84.71%). Then, our proposed loss further reduced the WER by 6.6% relative to this strong baseline, achieving a WER of 16.87%. In addition to the accuracy improvement, we also showed that the auxiliary output branch for the proposed loss can even be used for a secondary ASR for interference speakers’ speech.

#8 Speaker Adaptation for Attention-Based End-to-End Speech Recognition [PDF] [Copy] [Kimi1]

Authors: Zhong Meng ; Yashesh Gaur ; Jinyu Li ; Yifan Gong

We propose three regularization-based speaker adaptation approaches to adapt the attention-based encoder-decoder (AED) model with very limited adaptation data from target speakers for end-to-end automatic speech recognition. The first method is Kullback-Leibler divergence (KLD) regularization, in which the output distribution of a speaker-dependent (SD) AED is forced to be close to that of the speaker-independent (SI) model by adding a KLD regularization to the adaptation criterion. To compensate for the asymmetric deficiency in KLD regularization, an adversarial speaker adaptation (ASA) method is proposed to regularize the deep-feature distribution of the SD AED through the adversarial learning of an auxiliary discriminator and the SD AED. The third approach is the multi-task learning, in which an SD AED is trained to jointly perform the primary task of predicting a large number of output units and an auxiliary task of predicting a small number of output units to alleviate the target sparsity issue. Evaluated on a Microsoft short message dictation task, all three methods are highly effective in adapting the AED model, achieving up to 12.2% and 3.0% word error rate improvement over an SI AED trained from 3400 hours data for supervised and unsupervised adaptation, respectively.

#9 Large Margin Training for Attention Based End-to-End Speech Recognition [PDF] [Copy] [Kimi1]

Authors: Peidong Wang ; Jia Cui ; Chao Weng ; Dong Yu

End-to-end speech recognition systems are typically evaluated using the maximum a posterior criterion. Since only one hypothesis is involved during evaluation, the ideal number of hypotheses for training should also be one. In this study, we propose a large margin training scheme for attention based end-to-end speech recognition. Using only one training hypothesis, the large margin training strategy achieves the same performance as the minimum word error rate criterion using four hypotheses. The theoretical derivation in this study is widely applicable to other sequence discriminative criteria such as maximum mutual information. In addition, this paper provides a more succinct formulation of the large margin concept, paving the road towards a better combination of support vector machine and deep neural network.

#10 Large-Scale Mixed-Bandwidth Deep Neural Network Acoustic Modeling for Automatic Speech Recognition [PDF] [Copy] [Kimi1]

Authors: Khoi-Nguyen C. Mac ; Xiaodong Cui ; Wei Zhang ; Michael Picheny

In automatic speech recognition (ASR), wideband (WB) and narrowband (NB) speech signals with different sampling rates typically use separate acoustic models. Therefore mixed-bandwidth (MB) acoustic modeling has important practical values for ASR system deployment. In this paper, we extensively investigate large-scale MB deep neural network acoustic modeling for ASR using 1,150 hours of WB data and 2,300 hours of NB data. We study various MB strategies including downsampling, upsampling and bandwidth extension for MB acoustic modeling and evaluate their performance on 8 diverse WB and NB test sets from various application domains. To deal with the large amounts of training data, distributed training is carried out on multiple GPUs using synchronous data parallelism.

#11 SparseSpeech: Unsupervised Acoustic Unit Discovery with Memory-Augmented Sequence Autoencoders [PDF] [Copy] [Kimi1]

Authors: Benjamin Milde ; Chris Biemann

We propose a sparse sequence autoencoder model for unsupervised acoustic unit discovery, based on bidirectional LSTM encoders/decoders with a sparsity-inducing bottleneck. The sparsity layer is based on memory-augmented neural networks, with a differentiable embedding memory bank addressed from the encoder. The decoder reconstructs the encoded input feature sequence from an utterance-level context embedding and the bottleneck representation. At some time steps, the input to the decoder is randomly omitted by applying sequence dropout, forcing the decoder to learn about the temporal structure of the sequence. We propose a bootstrapping training procedure, after which the network can be trained end-to-end with standard back-propagation. Sparsity of the generated representation can be controlled with a parameter in the proposed loss function. We evaluate the units with the ABX discriminability on minimal triphone pairs and also on entire words. Forcing the network to favor highly sparse memory addressings in the memory component yields symbolic-like representations of speech that are very compact and still offer better ABX discriminability than MFCC.

#12 Bayesian Subspace Hidden Markov Model for Acoustic Unit Discovery [PDF] [Copy] [Kimi1]

Authors: Lucas Ondel ; Hari Krishna Vydana ; Lukáš Burget ; Jan Černocký

This work tackles the problem of learning a set of language specific acoustic units from unlabeled speech recordings given a set of labeled recordings from other languages. Our approach may be described by the following two steps procedure: first the model learns the notion of acoustic units from the labelled data and then the model uses its knowledge to find new acoustic units on the target language. We implement this process with the Bayesian Subspace Hidden Markov Model (SHMM), a model akin to the Subspace Gaussian Mixture Model (SGMM) where each low dimensional embedding represents an acoustic unit rather than just a HMM’s state. The subspace is trained on 3 languages from the GlobalPhone corpus (German, Polish and Spanish) and the AUs are discovered on the TIMIT corpus. Results, measured in equivalent Phone Error Rate, show that this approach significantly outperforms previous HMM based acoustic units discovery systems and compares favorably with the Variational Auto Encoder-HMM.

#13 Speaker Adversarial Training of DPGMM-Based Feature Extractor for Zero-Resource Languages [PDF] [Copy] [Kimi1]

Authors: Yosuke Higuchi ; Naohiro Tawara ; Tetsunori Kobayashi ; Tetsuji Ogawa

We propose a novel framework for extracting speaker-invariant features for zero-resource languages. A deep neural network (DNN)-based acoustic model is normalized against speakers via adversarial training: a multi-task learning process trains a shared bottleneck feature to be discriminative to phonemes and independent of speakers. However, owing to the absence of phoneme labels, zero-resource languages cannot employ adversarial multi-task (AMT) learning for speaker normalization. In this work, we obtain a posteriorgram from a Dirichlet process Gaussian mixture model (DPGMM) and utilize the posterior vector for supervision of the phoneme estimation in the AMT training. The AMT network is designed so that the DPGMM posteriorgram itself is embedded in a speaker-invariant feature space. The proposed network is expected to resolve the potential problem that the posteriorgram may lack reliability as a phoneme representation if the DPGMM components are intermingled with phoneme and speaker information. Based on the Zero Resource Speech Challenges, we conduct phoneme discriminant experiments on the extracted features. The results of the experiments show that the proposed framework extracts discriminative features, suppressing the variety in speakers.

#14 Building Large-Vocabulary ASR Systems for Languages Without Any Audio Training Data [PDF] [Copy] [Kimi1]

Authors: Manasa Prasad ; Daan van Esch ; Sandy Ritchie ; Jonas Fromseier Mortensen

When building automatic speech recognition (ASR) systems, typically some amount of audio and text data in the target language is needed. While text data can be obtained relatively easily across many languages, transcribed audio data is challenging to obtain. This presents a barrier to making voice technologies available in more languages of the world. In this paper, we present a way to build an ASR system system for a language even in the absence of any audio training data in that language at all. We do this by simply re-using an existing acoustic model from a phonologically similar language, without any kind of modification or adaptation towards the target language. The basic insight is that, if two languages are sufficiently similar in terms of their phonological system, an acoustic model should hold up relatively well when used for another language. We describe how we tailor our pronunciation models to enable such re-use, and show experimental results across a number of languages from various language families. We also provide a theoretical analysis of situations in which this approach is likely to work. Our results show that it is possible to achieve less than 20% word error rate (WER) using this method.

#15 Towards Bilingual Lexicon Discovery From Visually Grounded Speech Audio [PDF] [Copy] [Kimi]

Authors: Emmanuel Azuh ; David Harwath ; James Glass

In this paper, we present a method for the discovery of word-like units and their approximate translations from visually grounded speech across multiple languages. We first train a neural network model to map images and their spoken audio captions in both English and Hindi to a shared, multimodal embedding space. Next, we use this model to segment and cluster regions of the spoken captions which approximately correspond to words. Finally, we exploit between-cluster similarities in the embedding space to associate English pseudo-word clusters with Hindi pseudo-word clusters, and show that many of these cluster pairings capture semantic translations between English and Hindi words. We present quantitative cross-lingual clustering results, as well as qualitative results in the form of a bilingual picture dictionary.

#16 Improving Unsupervised Subword Modeling via Disentangled Speech Representation Learning and Transformation [PDF] [Copy] [Kimi1]

Authors: Siyuan Feng ; Tan Lee

This study tackles unsupervised subword modeling in the zero-resource scenario, learning frame-level speech representation that is phonetically discriminative and speaker-invariant, using only untranscribed speech for target languages. Frame label acquisition is an essential step in solving this problem. High quality frame labels should be in good consistency with golden transcriptions and robust to speaker variation. We propose to improve frame label acquisition in our previously adopted deep neural network-bottleneck feature (DNN-BNF) architecture by applying the factorized hierarchical variational autoencoder (FHVAE). FHVAEs learn to disentangle linguistic content and speaker identity information encoded in speech. By discarding or unifying speaker information, speaker-invariant features are learned and fed as inputs to DPGMM frame clustering and DNN-BNF training. Experiments conducted on ZeroSpeech 2017 show that our proposed approaches achieve 2.4% and 0.6% absolute ABX error rate reductions in across- and within-speaker conditions, comparing to the baseline DNN-BNF system without applying FHVAEs. Our proposed approaches significantly outperform vocal tract length normalization in improving frame labeling and subword modeling.

#17 Examining the Combination of Multi-Band Processing and Channel Dropout for Robust Speech Recognition [PDF] [Copy] [Kimi1]

Authors: György Kovács ; László Tóth ; Dirk Van Compernolle ; Marcus Liwicki

A pivotal question in Automatic Speech Recognition (ASR) is the robustness of the trained models. In this study, we investigate the combination of two methods commonly applied to increase the robustness of ASR systems. On the one hand, inspired by auditory experiments and signal processing considerations, multi-band band processing has been used for decades to improve the noise robustness of speech recognition. On the other hand, dropout is a commonly used regularization technique to prevent overfitting by keeping the model from becoming over-reliant on a small set of neurons. We hypothesize that the careful combination of the two approaches would lead to increased robustness, by preventing the resulting model from over-rely on any given band. To verify our hypothesis, we investigate various approaches for the combination of the two methods using the Aurora-4 corpus. The results obtained corroborate our initial assumption, and show that the proper combination of the two techniques leads to increased robustness, and to significantly lower word error rates (WERs). Furthermore, we find that the accuracy scores attained here compare favourably to those reported recently on the clean training scenario of the Aurora-4 corpus.

#18 Label Driven Time-Frequency Masking for Robust Continuous Speech Recognition [PDF] [Copy] [Kimi1]

Authors: Meet Soni ; Ashish Panda

The application of Time-Frequency (T-F) masking based approaches for Automatic Speech Recognition has been shown to provide significant gains in system performance in the presence of additive noise. Such approaches give performance improvement when the T-F masking front-end is trained jointly with the acoustic model. However, such systems still rely on a pre-trained T-F masking enhancement block, trained using pairs of clean and noisy speech signals. Pre-training is necessary due to large number of parameters associated with the enhancement network. In this paper, we propose a flat-start joint training of a network that has both a T-F masking based enhancement block and a phoneme classification block. In particular, we use fully convolutional network as an enhancement front-end to reduce the number of parameters. We train the network by jointly updating the parameters of both these blocks using tied Context-Dependent phoneme states as targets. We observe that pretraining of the proposed enhancement block is not necessary for the convergence. In fact, the proposed flat-start joint training converges faster than the baseline multi-condition trained model. The experiments performed on Aurora-4 database show 7.06% relative improvement over multi-conditioned baseline. We get similar improvements for unseen test conditions as well.

#19 Speaker-Invariant Feature-Mapping for Distant Speech Recognition via Adversarial Teacher-Student Learning [PDF] [Copy] [Kimi1]

Authors: Long Wu ; Hangting Chen ; Li Wang ; Pengyuan Zhang ; Yonghong Yan

Feature mapping (FM) jointly trained with acoustic model (AFM) is commonly used for single-channel speech enhancement. However, the performance is affected by the inter-speaker variability. In this paper, we propose speaker-invariant AFM (SIAFM) aiming at curtailing the inter-talker variability while achieving speech enhancement. In SIAFM, a feature-mapping network, an acoustic model and a speaker classifier network are jointly optimized to minimize the feature-mapping loss and the senone classification loss, and simultaneously min-maximize the speaker classification loss. Evaluated on AMI dataset, the proposed SIAFM achieves 4.8% and 7.0% relative word error rate (WER) reduction on the overlapped and non-overlapped condition over the baseline acoustic model trained with single distant microphone (SDM) data. Additionally, the SIAFM obtains 3.0% relative overlapped WER and 4.2% relative non-overlapped WER decrease over the multi-conditional (MCT) acoustic model. To further promote the performance of SIAFM, we employ teacher-student learning (TS), in which the posterior probabilities generated by the individual headset microphone (IHM) data can be used in lieu of labels to train the SIAFM model. The experiments show that compared with MCT model, SIAFM with TS (SIAFM-TS) can reach 4.2% relative overlapped WER and 6.3% relative non-overlapped WER decrease respectively.

#20 Full-Sentence Correlation: A Method to Handle Unpredictable Noise for Robust Speech Recognition [PDF] [Copy] [Kimi1]

Authors: Ji Ming ; Danny Crookes

We describe the theory and implementation of full-sentence speech correlation for speech recognition, and demonstrate its superior robustness to unseen/untrained noise. For the Aurora 2 data, trained with only clean speech, the new method performs competitively against the state-of-the-art with multicondition training and adaptation, and achieves the lowest word error rate in very low SNR (-5 dB). Further experiments with highly nonstationary noise (pop song, broadcast news, etc.) show the surprising ability of the new method to handle unpredictable noise. The new method adds several novel developments to our previous research, including the modeling of the speaker characteristics along with other acoustic and semantic features of speech for separating speech from noise, and a novel Viterbi algorithm to implement full-sentence correlation for speech recognition.

#21 Generative Noise Modeling and Channel Simulation for Robust Speech Recognition in Unseen Conditions [PDF] [Copy] [Kimi1]

Authors: Meet Soni ; Sonal Joshi ; Ashish Panda

Multi-conditioned training is a state-of-the-art approach to achieve robustness in Automatic Speech Recognition (ASR) systems. This approach works well in practice for seen degradation conditions. However, the performance of such system is still an issue for unseen degradation conditions. In this work we consider distortions due to additive noise and channel mismatch. To achieve the robustness to additive noise, we propose a parametric generative model for noise signals. By changing the parameters of the proposed generative model, various noise signals can be generated and used to develop a multi-conditioned dataset for ASR system training. The generative model is designed to span the feature space of Mel Filterbank Energies by using band-limited white noise signals as basis. To simulate channel distortions, we propose to shift the mean of log spectral magnitude using utterances with estimated channel distortions. Experiments performed on the Aurora 4 noisy speech database show that using noise types generated from the proposed generative model for multi-conditioned training provides significant performance gain for additive noise in unseen conditions. We compare our results with those from multi-conditioning by various real noise databases including environmental and other real life noises.

#22 Far-Field Speech Enhancement Using Heteroscedastic Autoencoder for Improved Speech Recognition [PDF] [Copy] [Kimi1]

Authors: Shashi Kumar ; Shakti P. Rath

Automatic speech recognition (ASR) systems trained on clean speech do not perform well in far-field scenario. Degradation in word error rate (WER) can be as large as 40% in this mismatched scenario. Typically, speech enhancement is applied to map speech from far-field condition to clean condition using a neural network, commonly known as denoising autoencoder (DA). Such speech enhancement technique has shown significant improvement in ASR accuracy. It is a common practice to use mean-square error (MSE) loss to train DA which is based on regression model with residual noise modeled by zero-mean and constant co-variance Gaussian distribution. However, both these assumptions are not optimal, especially in highly non-stationary noisy and far-field scenario. Here, we propose a more generalized loss based on non-zero mean and heteroscedastic co-variance distribution for the residual variables. On the top, we present several novel DA architectures that are more suitable for the heteroscedastic loss. It is shown that the proposed methods outperform the conventional DA and MSE loss by a large margin. We observe relative improvement of 7.31% in WER compared to conventional DA and overall, a relative improvement of 14.4% compared to mismatched train and test scenario.

#23 End-to-End SpeakerBeam for Single Channel Target Speech Recognition [PDF] [Copy] [Kimi1]

Authors: Marc Delcroix ; Shinji Watanabe ; Tsubasa Ochiai ; Keisuke Kinoshita ; Shigeki Karita ; Atsunori Ogawa ; Tomohiro Nakatani

End-to-end (E2E) automatic speech recognition (ASR) that directly maps a sequence of speech features into a sequence of characters using a single neural network has received a lot of attention as it greatly simplifies the training and decoding pipelines and enables optimizing the whole system E2E. Recently, such systems have been extended to recognize speech mixtures by inserting a speech separation mechanism into the neural network, allowing to output recognition results for each speaker in the mixture. However, speech separation suffers from a global permutation ambiguity issue, i.e. arbitrary mapping between source speakers and outputs. We argue that this ambiguity would seriously limit the practical use of E2E separation systems. SpeakerBeam has been proposed as an alternative to speech separation to mitigate the global permutation ambiguity. SpeakerBeam aims at extracting only a target speaker in a mixture based on his/her speech characteristics, thus avoiding the global permutation problem. In this paper, we combine SpeakerBeam and an E2E ASR system to allow E2E training of a target speech recognition system. We show promising target speech recognition results in mixtures of two speakers, and discuss interesting properties of the proposed system in terms of speech enhancement and diarization ability.

#24 NIESR: Nuisance Invariant End-to-End Speech Recognition [PDF] [Copy] [Kimi1]

Authors: I-Hung Hsu ; Ayush Jaiswal ; Premkumar Natarajan

Deep neural network models for speech recognition have achieved great success recently, but they can learn incorrect associations between the target and nuisance factors of speech (e.g., speaker identities, background noise, etc.), which can lead to overfitting. While several methods have been proposed to tackle this problem, existing methods incorporate additional information about nuisance factors during training to develop invariant models. However, enumeration of all possible nuisance factors in speech data and the collection of their annotations is difficult and expensive. We present a robust training scheme for end-to-end speech recognition that adopts an unsupervised adversarial invariance induction framework to separate out essential factors for speech-recognition from nuisances without using any supplementary labels besides the transcriptions. Experiments show that the speech recognition model trained with the proposed training scheme achieves relative improvements of 5.48% on WSJ0, 6.16% on CHiME3, and 6.61% on TIMIT dataset over the base model. Additionally, the proposed method achieves a relative improvement of 14.44% on the combined WSJ0+CHiME3 dataset.

#25 Knowledge Distillation for Throat Microphone Speech Recognition [PDF] [Copy] [Kimi1]

Authors: Takahito Suzuki ; Jun Ogata ; Takashi Tsunakawa ; Masafumi Nishida ; Masafumi Nishimura

Throat microphones are robust against external noise because they receive vibrations directly from the skin, however, their available speech data is limited. This work aims to improve the speech recognition accuracy of throat microphones, and we propose a knowledge distillation method of hybrid DNN-HMM acoustic model. This method distills the knowledge from acoustic model trained with a large amount of close-talk microphone speech data (teacher model) to acoustic model for throat microphones (student model) using a small amount of parallel data of throat and close-talk microphones. The frontend network of the student model contains a feature mapping network from throat microphone acoustic features to close-talk microphone bottleneck features, and the back-end network is a phonetic discrimination network from close-talk microphone bottleneck features. We attempted to improve recognition accuracy further by initializing student model parameters using pretrained front-end and back-end networks. Experimental results using Japanese read speech data showed that the proposed approach achieved 9.8% relative improvement of character error rate (14.3% → 12.9%) compared to the hybrid acoustic model trained only with throat microphone speech data. Furthermore, under noise environments of approximately 70 dBA or higher, the throat microphone system with our approach outperformed the close-talk microphone system.